Machine Learning / Aprendizagem Automática

Ruben Branco, Eduardo Castanho, Sofia Teixeira, Sara C. Madeira, and Luís Correia, 2021/22

ML Project - Learning about Hit Songs - Team 33

Introduction

Let's imagine that we manage a radio, HIT FM! Also let's supose we want to pass only hit songs, we don't want not-successful tracks. So, how should be the music we accept from which artist? Which levels of loudness, energy or instrumentalness should the songs reach to be accepted in our broadcast emisions?

image.png

Statistics and Machine Learning are the best tools to answer those questions. With this goal, the radio has contract 4 data scientist to developed a series of models in order to better understand how the hits and flops are and which are their characteristics.

The first task consists in predict two target variables: Hit and Valence. Hit determines the success of a track instead of flop, which are the songs without repercusion. Besides, Valence indicates the positiveness of the track. Valence was split in three categories: Negative, Neutral and Positive.

HIT FM! is a very exclusive radio, so, it doesn't conform with only 1 prediction, it needs 4 for each target variable:

On the other hand, to be sure, it wants to classify their tracks through unsupervised models. Not enough with a cluster analysis, but two, and a rule association mining.

The selected team is composed by:

The four analysts decided that the best models to answer the prediciton requests are:

For the second task, the chosen classifiers are:

From here the team starts its prediction and classification solutions. First of all, importing all the necessaries python libraries.

Libraries import

Task 0 (Know your Data) - Exploratory Data Analysis

0.1 Loading Data

0.2. Understanding Data

An initial analysis is carried out to better understand the data. In order to look for missing values or NaN, duplicate entries, errors/typos and if yes, how to deal with them.

Data Fields Description

Initial insights

The features were classified into labels according with their type:

Numerical:

Categorical:

Meta Data:

Check for errors/typos

The team searches for unexpected values in integer/object columns (except for those considered as meta data):

No unexpected values found

Check for duplicates

There are no duplicates considering all columns.

The team have found multiple rows which have duplicate information. It was no easy to define exactly what was a duplicate, because not all the duplicate rows were exactly. In some cases the only repetition was the track name, with different artist or other variables such as liveness, so, in those cases we considered as another version of the same song, but different tracks. Thus, the team analysed and removed the duplicated rows in three rounds, considering all columns except: track, artist and uri sequentially.

Below some examples of duplicates

0.3 .Data cleaning and Preprocessing

Remove duplicates

Check for outliers

The next step is the outliers research, for that we have drawn a bloxplot for every variable, focusing on the possible explanatory variables, such as speechiness, acousticness or energy.

image.png width="400"

image.png

image.png

image.png

The most shocking situation regarding the extreme values is about loudness and tempo which have several values lower and greater than the 5º and 95º percentils. Any case, we decided to not remove or limit them, because the difference was not too much. Also because if the tempo is super fast or slow it is not necessarly wrong data.

Besides, speechiness, instrumentalness and liveness have multiples extreme values very far away from the median and the percentil 95. However, the range of these variables are [0, 1], since they are inside the variable domain we considered normal values.

Finally, after data cleaning were deleted around 2500 rows, due to the duplicates removing.

0.4 Exploratory and Descriptive Analysis

One of the most important need when modelling is to know your data:

This overview and other details can be analyzed thanks to the report performed through the pandas_profiling library.

Next, we coment some highlights, but the reader can deep the descriptive analysis downloading an HTML ☞ https://drive.google.com/file/d/1rFFtYCAnTQ5vyTBPEsO3ZX7Ji30tANHd/view?usp=sharing

image.png

Distribution of Danceability variable (above). Distribution of Decade (below).

image.png

Following, interaction between energy and danceability. As we can see, higher the energy, higher the danceability. But only for values of energy lower than 0.8 (aprox).

image.png

Interaction between energy and loudness. It seems like could be a positive linear correlation between them.

image.png

Correlation analysis

Inspired by the last image, the team performed the correlation matrix. This analysis computes the Pearson linear correlation coeficients. This indicator shows the linear proportionality in pairs of variables. Their values vary from -1 to 1. Being zero the complete absence of relation between both variables. When the coeficient is close to 1, indicates that both features are directly related, so, one increase when the first one does. While the correlation is near -1, there exists negative relationship.

From 0.6, in absolute value, the correlation between two variables is considered high, being strong when higher than 0.75. So, watching the table above the strongest correlation are:

Generally, there are only 3 strong relationships, which is not too much considering that we have tested more than 10 (in the output are only the most important). However, correlation coeficients give us highlights of the relative importance, but we can not assume contundent conclusions from this analysis. For that, we have to wait to supervised and unsupervised learning algorithms.

The next plot give us another vision of the correlations. In the main diagonal we have the histogram for each variable, out of it, a scatter plot between pairs of variables.

0.6 Feature Selection

One of the goals of our work is to predict, before passing on the radio, if a new song will be hit or flop. For that, in the future, we will introduce in the machine classification model the new track characteristics, such as instrumentalness or danceability. Then the machine will return the probability of being a Hit. If this probability is higher than some threshold the team will recommend broadcasting the track.

Our work as data analysts is to define which of those variables are significant and so, which ones are important to input in the machine and use them to calculate the Hit probability. The second task of a data analyst is to determine the threshold from which a song becomes Hit or Flop. The cut-off computation will be discussed below. To predict the discrete valence of the tracks, we follow the same logic described above.

First, we have built a function that returns the most significant variables and discards those which does not add relevant information.

This function forward_selectedCV has 4 parameters: train set, target variable and other 2 penalty parameters to correct the coefficients if necessary.

The way the functions works is: It tests univariately each variable against the target and picks the accuracy of the training test. After having evaluated every predictor, it takes the best one and fixes it. In the second iteration, the function calculates de accuracy of the models with the first best variable adding one by one the remaining. The function repeats the process until the remaining variables do not improve the model accuracy. Then these selected variables are considered definitive.

Task 1 (Supervised Learning) - Predicting Hit or Flop and Discrete Valence

1.1. Preprocessing Data for Classification

The team starts by adding the feature discrete valence to the dataset, which summed up to binning the feature valence into to three classes - Negative, Neutral and Positive

Labelling the features by data types

Further preprocessing steps will be performed as required for each classifier.

1.1.1. Logistic Regression

Predictors in logistic regression can be discrete or continuous. Originally the independent variables we have are continuous. We have discretized them to gain human interpretability and manageability, in spite of some loss of information we can induce.

The way we make categories is founded in univariate analysis, facing Hit (for example) against the predictor variables. From this analysis, we obtain two indicators: Weight of Evidence (WoE) and Information Value (IV).

Weight of Evidence is widely used to measure the separation of the two levels of the dependent variable. Calculates the logarithmic ratio between the percentage of event and not an event for each category. Negative values indicate that in that class there are more non-events than events (because the internal ratio is lower than zero). In conclusion, the higher the absolute value of this indicator higher the discriminant importance of the category.

image.png

On the other hand, we count with Information Value, which is related to the WoE and it is used more frequently to define the importance of an entire variable or model rather than the categories.

image.png

The team has developed a function that calculates these indicators and propose new classes.

image.png

Above Danceability output.

The function returns several columns, in which the most important are Min and Max values, the event rates and WOE an IV.

It performs multiple values for each variable, then we have rejoint to up 4 - 5 different groups by variable, if not, interpretability would be too hard. To optmize the creation of the classes we joint cathegories with similar event_rates and WOE.

Next, the cathegorization process.

1.1.2. Naïve Bayes

These classifiers are constructed on statistical and probabilities concepts. Naïve Bayes classifiers are fundamental on the Bayes theorem and the basis of the classifier is the assumption about the effect of a feature in a given class being independent of the effect of any other feature on the same class. Because this is not very realistic hence it’s “naïve”. Even if this assumption is not true it still can do a good classification job under most circumstances, and not overfit, so it can usually be used as a baseline for initial analysis and the performance of other classifiers. Although not essential to our work, these algorithms are also fast and usually used when dealing with real-time pipelines of large amounts of data.

The likelihood of the features is assumed to be Gaussian. The parameters and are estimated using maximum likelihood. The general expression is shown below.

image.png

Considering the characteristics and distribution of our features, which have considerably different magnitudes compared to each other e.g. sections can have values in the hundreds while danceability maximum value is one, and have different units of measurement e.g. loudness and tempo, we decided to employ a standardization to mean average equal to zero and standard deviation equal to one.

The function forward_selectedCV was modified to be adapted to the new requirements: the One-Hot Encoding of non-ordinal categorical features.

1.1.3. Decision Tree

In this work we use the optimized version of the CART algorithm available in Skikit-learn. Because we have many numerical attributes this algorithm allows us to have multiple splitting points for those attributes. We have no missing values. We have a dataset with a very good balance, especially in the ‘hit’ target. No re-scaling is usually needed for building decision trees and so we leave the magnitude and distribution of features has they are. The algorithm also has the most common selection measures, Gini and Gain Ratio. Gain Ratio attempts to overcome some bias in the selection measure Information Gain, which is not used. A drawback of Scikit-learn CART is that it cannot deal with categorical features natively, so we have to one hot encode those features. Finally, it also allows for multiclass decision trees which we also need for the Discrete Valence classification task.

1.1.4. KNN

The K-Nearest Neighbours (KNN) is a supervised machine learning algorithm. It assumes that similar data points exist in close proximity and aim to locate all of the closest neighbours around a new unknown data point to figure out what class it belongs to. KNN calculates the distance from all points in the proximity of the unknown data and filters out the ones with the shortest distances to it. For this reason, KNN is often known as a distance-based algorithm. To correctly classify the results, we must first determine the value of K (Number of Nearest Neighbours) - selecting the right value of K is crucial for a solid prediction.

Thus, we proceeded to perform two KKN classification tasks: Predicting Hit or Flop and Predicting Discrete Valence. For each task, we started by pre-processing the dataset (binning, encoding categorical features, scaling and normalization) and inferring which are the best candidate features and K values for the prediction model. After this evaluation, we finally performed the classification.

1.2. Learning Classifiers

1.2.1 Binary Classification Task (Hit or Flop)

We are now able to train our classification models to predict future hits.

1.2.1.1. Logistic Regression

image.png

Once we have our feature selection, we build and validate the definitive model through the test dataset and analyze several indicators, such as precision, recall and f1-score from the confusion matrix.

Sometimes when training a model, it could be overfitted, that is the model is too good defined to the train data, so, when you tested in another data set the accuracies performances are no longer good. The model is only useful for the train data.

To avoid this situation, we input one penalty C which varies from 0 to 1. It limits the coefficients. Some tests were done without success because the results did not change. Finally, we did not input this penalty.

1.2.1.2. Naïve Bayes

For the Naïve Bayes learning classifier, we proceeded to split the dataset into train and test sets, then we scaled and normalized both new sets. Thus, we infer the best-selected features to be used in the Naïve Bayes classification model further:

1.2.1.3. Decision Tree

A simple decision tree for binary with max_depth = 3 for visualization purposes

Despite the complexity of the tree, the depths that appear here without leaves will be the same for our following models, except if we use a different selection measure. So, we can right away see that instrumentalness is important in our trained classifier, showing up at the top of our tree. The other features that also show up right at the top are acousticness and danceability. Where the threshold is set by the selection measure is also interesting. Instrumentalness is divided very low at 0.046. In this simple tree we see that instrumentalness even shows up a number of times.

In this project, we use the optimized version of the CART algorithm available in Skikit-learn. Because we have many numerical attributes this algorithm allows us to have multiple splitting points for those attributes. We have no missing values. We have a dataset with a very good balance, especially in the hit target. No re-scaling is usually needed for building decision trees and so we leave the magnitude and distribution of features as they are. The algorithm also has the most common selection measures, Gini and Gain Ratio. Gain Ratio attempts to overcome some bias in the selection measure Information Gain, which is not used. A drawback of Scikit-learn CART is that it cannot deal with categorical features natively so we have to One-Hot Encode those features. Finally, it also allows for multiclass decision trees which we also need for the Discrete Valence* classification task.

The results are below what could be expected. Next, we tried several grid searches in order to optimize our hyper-parameters for the tree.

Over-complex trees do not generalize well. Too little pruning and it will overfit, too much and will not learn well enough. We found some instances where the training score was close to 1 so we tried to avoid those situations. We tried with both selection measures available and they gave comparable results so we opted to use Gini because it is slightly faster. And tried to optimize with different pruning measures. In the end, we found out that some measures gave equivalent results. For example, pruning the tree with either min_samples_leaf and max_depth gave comparable results to using ccp_alpha. In the end, we opted for the min_samples_leaf and max_depth combination, which has more immediate meaning to us. The min_samples* leaf criteria seemed to be the most impactful so our best hyperparameters were very tuned around it.

1.2.1.4. KNN

Considering the results obtained in the previous step, we agreed that the best results were achieved with K=7 (by a tradeoff between the number of features and classifier performances).

1.2.2 Results and discussion for the binary classification task

1.2.2.1. Logistic Regression

Interpretation of the confusion matrix:

The three indicators show good levels since they are higher than 70%.

Another concern we had was the overfitting. Thanks to the good application of the statistical techniques we have done, we have not that problem. We know it because the difference between accuracies when calculated from train and test datasets is minimal. The same when we compute the confusion matrix, and their consequent indicators (precision, recall and F1-Score).

1.2.2.2. Naïve Bayes

image-2.png

The best selected features for the NB classification model was:

Thus, from the confusion matrix, we can infer that the NB model has correctly precited 2774 tracks as flops and 4482 hit tracks. Besides, the prediction of the flop class was well supported by the metrics of the classification report, compared with the hit class.

Due to the NB model not having hyperparameters for the calibration, the obtained results were worse than expected. Although, it is a machine learning classification model which runs very fast.

1.2.2.3. Decision Tree

We validated the chosen classifier and hyperparameters with our test set. The results are comparable to those obtained when we did the grid search. The tree shows a reasonable score for the task at hand.

image.png

The model shows high performances, in the confusion matrix we see that it can detect more than 4000 true Hits. Consequently, we obtain a precision of 73% and a recall of 82%. After making the harmonic mean of these two indicators we obtain a F1 score equal to 77%.

1.2.2.4. KNN

Inferring the results of the KNN classification model to predict Hit and Flop the best-selected features are:

The test_train set it is just to check if there is occurring overfitting in the classification model, so we will only evaluate the validation sets (X_val and y_val).

Accordingly to the confusion matrix, we have correctly predicted 3710 tracks as flops and 4584 hit tracks. Besides we conclude from the classification report, that the model is very good precitor of hits, we can see it in the recall level equal to 90% (on hits). The precision is also good, 77%

1.2.2.5 ROC and Precision-Recall Curves Logistic Regression - Hit prediction

Also, we test the quality of our classification by drawing two plots and calculating their Areas Under the Curves (AUC), that is, ROC AUC and Precision-Recall AUC.

The ROC Curve is mainly used to test the discrimination power of a binary classifier. It is created by plotting the true positive rate against the false-positive rate at various threshold settings. The true positive is similar to Recall. While the false positive rate is the false Hit predictions over the real flop.

Higher these areas better the classification.

image-2.png

The Area Under the Curve (AUC) is 0.80, which indicates that the model classification power is very high

On the other hand the area under the Precision-Recall Curve (0.66) is not that high as the AUC ROC but it is enough to admit that the model classifies good.

Threshold Definition

As we have explained above, it is very important to analyze which is the best cut-off from which a song is considered Hit. By default, scikit-learn assign this limit to 0.5. So, when the predicted Hit probability is greater than 0.5 its estimation considers that track as Hit. Then it calculates the indicators we have to analyze before selecting the most significant features, such as accuracy, precision, recall and F1-Score.

But the business requirements are more demanding than 0.5. Because when a radio requires only hits, that means only hits. So, we must maximize the recall indicator. The recall calculates the percentages of predicted true hits in the model that are real Hit, it tries to minimize the false Hit. Besides, the precision tries to minimize false flops. From all the Hit, how many the model indicates they are. If we were hired by a radio called No-Flop FM! We should maximize precision.

So, we have developed a function that allows us to play with different cut-offs to find the one which better meets our requirements.

image.png

Test results:

Threshold = 0.6

Threshold = 0.7

Threshold = 0.8

If we take the riskiest scenario, threshold = 0.8, the recall is very high, it only fails in 16.6% of the tracks, but the precision is too low, only 24%. So, there would be 76% of Hit that the radio would lose and will gift to the other radios. Additionally, the entire test set (N = 10208) only take account of 15% of the tracks.

In conclusion, the most prudent solution, at least at the beginning, is to take the second scenario. With a threshold equal to 0.7, the recall is still very high, 78.8% and the precision increases 20 percentual points (up to 45%). Also, the variety of music tracks that the radio can play is much bigger, around 30% of the proposals.

1.2.3 Multinomial Classification Task (Discrete Valence)

In the next step, we build a model which predicts the feature discrete valence. The main difference between the next models and the previous is that the target variable has 3 levels, it is multinomial.

The first thing we do is to rebalance the discrete valence and split the data in train and test

1.2.3.1. Logistic Regression

1.2.3.2. Naïve Bayes

In here the stratification is an even more important step because the new target discrete_valence is not perfectly balanced.

1.2.3.3. Decision Tree

A simple decision tree for multinomial with max_depth = 3 for visualization purposes

image.png

The tree scored slightly lower for the multinomial but still the drop from one task to the other was less than with other models. The grid search presented similar results to the binary classification tree when looking for hyper-parameter tuning.

1.2.3.4. KNN

This time, to predict Discrete Valence, we choose K=13 as the best value for K to be used in the KNN classification model.

For K=13

1.2.4 Results and discussion for the multinomial classification task

1.2.4.1. Logistic Regression

image.png

The test accuracy is around 0.61. It is close to train's, so we discard overfitting. These indicators are both lower than the Hit prediction, especially for the class Neutral. Their precision and recall are so low, under 50%. However, the model is a good predictor for Negative and Positive

1.2.4.2. Naïve Bayes

image.png

image.png

This model was the one with the weaker results and the largest drop in score between tasks.

For this classifier the selected features are:

Although the performance of the classifier is lower than others, we can see some resemblance to the selected features of other classifiers for valence. From the classification report we see that f1 score is lower for Neutral valence and higher for positive valence. Precision and recall follow a similar trend although Negative valence has a higher value. here we are considering all valence of same importance but this model could still be useful if we wanted to give preference to the prediction of Positive Valence.

The confusion matrix shows more detail into what target is best defined by our classifier. We can see for example that True Positives are seldom classified has Negative by our model much more than the inverse, meaning once again that the classifier is best in finding Positive valence tracks.

1.2.4.3. Decision Tree

In here we validate the chosen classifier and hyperparameters with our test set. The results are similar to what we got out of the grid search which is positive.

image.png

image.png

The results for the decision tree are poorer than for the binary classification following the trend of the other classifiers. Although we had considerable less performance for this multinomial classification task, the optimized hyperparameters of the tree are very close to the ones found in the binary classification.

The prediction of discrete valence through the decision three is good, however neutral class is not well classified. For recall we have a similar situation, the two important values of discrete valence are well predicted, negative and positive. It can be seen it in the confusion matrix above, that the misclassification are more frequent in neutral class..

1.2.4.4. KNN

Evaluating the obtained results of the classification model to predict Discrete Valence, we started to infer that the best-selected features were:

The confusion matrix presents the matrix that correlates the true and the predicted labels for all the classes of the target Discrete Valence (Negative, Neutral and Positive). Infering the results, supported with by the classification report, we can conclude that the KNN model:

Task 2 (Unsupervised Learning) - Characterizing Tracks

2.1. Introduction

Sometimes these models are not perfect, they need some help from other tools, such as rules. It is common to have a track with high Hit probability but after some time it fails. For example, a track with more than 10 minutes is difficult to be a Hit, even if its danceability and energy are the best, so input exclusion rules before take the final decision of if a song wil be a Hit is recommendable.

2.2. Preprocessing Data for Association Rule Mining

Association rule mining uses only categorical data in which each value is considered a different item. So, the first step is to understand which features have values that may or may not have meaning in their original form.

From the table above we see that there are many features with a great number of possible values. Excluding the first three (track, artist, uri) as meta features, as already done before, most features show values in the scale of thousands, and are floats, representing perhaps the idea of a continuous variable. To ensure that the model runs in reasonable time and to help with the interpretation of results it is therefore necessary to group/bin the values for each of these features.

Features with a relatively small number of values are key, mode, time, signature, decade which we already considered before as categorical features with no ordinal meaning.

Hit is the target from the previous tasks and is also categorical.

Sections is the only feature that is numerical, is ordered and is an integer and interestingly it has a relatively intermediate number of possible values. For this feature we also use binning to improve interpretation of results.

2.3. Finding Associations

The power of an itemset and rule can be quantified by concrete measures but based on those measures, the relative interest between different itemsets and between different rules is more difficult to assess. Whether or not a rule is interesting can be assessed either subjectively or objectively [Data Mining Concepts and Techniques, Han et al., 2011]. Those measures are also associated with each dataset so a good support in one dataset might be a poor one in another. Considering all this we experimented with different combinations of several arguments to find a set or itemsets and rules that could be of interest considering not only the concrete numbers but also knowledge domain (with as little knowledge as we may have of the music industry) essencial for a good association rule mining.

Below we use the previously defined functions to extract the itemsets and rules. We start by considering that the predicted targets from Task 1 (hit, discrete valence) should be part of the features in this analysis because it would be valuable to find if there are any interesting associations involving them. By considering this features we can consolidate the notions we have for the dataset and the other tasks by finding similarities between what we may observe in this task and the others.

To bin a feature in the context of association rule mining is to define that particular bin as a category of it's own. Because of this, an increased number of bins means that we have more items in our dataset and each item will show up less. So, it seems important to strike a balance between extracting meaningful (and actionable rules) and diluting the meaning of each feature such that it will show up less with increasing number of bins. Because of this we used the previously defined bins in Task 1 which to remind were:

In order to get a better view of the dataset we set our thresholds low both for support of itemsets and threshold of association rules. Extracting frequent itemsets and rules:

2.3.1 Frequent itemsets

In the above output we observe a 2000+ number of itemsets obtained.

In the above output we show only the top 5 itemsets by support, grouped by length. We observe as expected that the support decreases for larger itemsets.

The result shown by itemsets of length one or two could probably be expected by a simpler look at the data e.g. the majority of tracks have time_signature_4 so it might be expected to have a strong support between it and other features with higher counts. The category mode_1 is also pervasise so it is natural that the intersection of these two categories results in a length two itemset with also great support. Indeed time_signature_4 and mode_1 is the length two most observed itemset in this dataset, with ~61% of tracks showing those categories.

Also, looking at the higher length itemsets some interesting relations can be found. For example there is one itemset with group length 5 that means that low loudness, low energy, high acousticness, high instrumentalness and flop are found together an interesting amount of times (~6%).

2.3.2 Association rules introduction

Some 11000+ rules were found.

In the graphs above, we cross the 3 most important measures for our analysis, support, confidence and lift. In all the graphs we have a heatmap for the lift values. In order to extract a good set of rules from the large amount available to us we need to strike a balance between these three measures because generally not all three are positively correlated to one another at the same time e.g. we can get this idea from the support vs lift graph that these two features are negatively correlated. We see that for higher values of lift we have support going to zero, also for values of lift above 1 we don't see many correspondece of support above 0.1.

In the above table there are several measures of interest previously mentioned e.g. support, confidence, lift. Combining in different ways the order of those variables we may find different rules of interest. Because support and confidence are misleading as to a strong association rule we need to quantify also the correlation between the ocurrences which can be done with lift as a measure of interestingness.

For the remaining chapter we aim to find rules which have a lenght of the antecedent or consequent >1, skipping simpler rules that could be translated by a simple correlation analysis. We also aim for considerable confidence so rules below 50% are not of interest to us and we also want to find a lift value >1 which means that the antecedent implies an increase in the consequent by a fraction equal to the lift.

Going for the larger lengths of antecedent support and trying to find hit or valence as a consequent we find for example:

(discrete_valence_Positive, mode_1, instrumentalness_0.0, hit_1) then (hit_1) meaning that the ocurrence of the first lifts increases the ocurrence of the latter by a factor of 1.10. mode_1 and intrumentalness_0.0 have higher counts than most categories, but it may be interesting that a more balanced category like discrete_valence_Positive is part of this rule that implies a hit.

2.3.3 Association rules exploration

In this section we we look for rules considering the previously mentioned restraints and we decide that the most important rules according to our work are rules where occur, hit, flop or discrete valence.

Rules for hit_1 as a consequent, ordered by confidence, with considerable length of the antecedent

In the table above we can find rules with great confidence and considerable lift, being that rules of length 3 are already quite interesting and have good values for those interestingness measures we need to resort to knowledge domain to highlight the following five rules from this set, all implying 'hit_1':

The support for these rules is around 0.05-0.10 with very good confidence above 75% and reasonable lift above 1.5.

Rules for hit_1 as a consequent, ordered by lift, with considerable length of the antecedent

In the above table we repeat the process but ordered by lift and we can see that the rules with best lift have lower confidence like shown in a previous graph. We cannot aim for a lift larger than 2 if we consider our constraint of having the confidence above 50%. The results for confidence values we find acceptable will show approximately the same rules as the ones highlited in the filter before this one.

Rules for hit_0 as a consequent, ordered by confidence, with considerable length of the antecedent

In the table above we again can find rules with great confidence and considerable lift, being that rules of length 3 are already quite interesting and have good values for those interestingness measures we need to again resort to knowledge domain to highlight the following five rules from this set, all implying 'hit_0':

In this table see more interesting facts. The top confidence and lift values are larger than when we were looking for the same rules but implying 'hit_1'. We find confidence of nearly 1 and lifts of nealy 2. Also, the rules with the highest confidence are rules with larger lengths of 3 and 4. With all this we may conclude that there are stronger rules for 'hit_0' then for 'hit_1'

In the table we can still observe that the category 'discrete_valence_Negative shows up many times in different rules but is not part of the lenght one top rules with most confidence.

Rules for hit_0 as a consequent, ordered by lift, with considerable length of the antecedent

In the table above we again can find rules with the highest values of lift so far and which also show good confidence. In here we observe that the consequents also appear more frequently in sets of two and three.

Rules for hit_0 as a consequent, ordered by lift, with considerable length of the antecedent

We cannot find very good results for confidence when looking for valence, the best results are around the 60% mark. From here we highlight the fact that high 'danceability' and 'energy' imply 'discrete_valence_Positive' in several rule, but also 'hit' when grouped with several other variables can be considered a good rule to imply 'discrete_valence_Positive'.

Rules for hit_0 as a consequent, ordered by lift, with considerable length of the antecedent

When ordered by lift the values of confidence are generaly poor so we don't chose any rule from this table.

2.4. Association Rules - Results and Discussion

In this section we further our discussion of the previous results and their relation to other parts of this work and discuss the advantages and disadvantages of the method.

From the exploration above we conclude that the most interesting rules show up with a support of around 0.05 which was our bottom threshold for support. Below this we could find more rules but to find rules that occured below this mark would not be deemed very significant from our point of view.

In summary we show here the most important rules extracted:

The most interesting rules that imply a hit show up with low instrumentalness and below median (20-40 percentile) acousticness which may mean that music that has many words and low confidence (as defined by the feature, not related to the confidence of our model) that the track is acoustic implies a hit, this could be any pop song we generaly hear on the radio or MTV. Many of the found rules show low intrumentalness meaning that music with many words shows up a lot grouped with other categories to imply a hit. Also, music of the 10s decade also shows up in our selected rules. The dataset is relatively balanced according to decade so this may imply that more modern songs, when associated to low many words and a generic time signature (4/4) implies more of a hit than music from other decades. Music with duration above median (60-80) also shows up associated to other categories and implying hit. A very high danceability, associated with common acousticness values and common time signature also implies hit.

On the other hand, the most interesting rules that imply a flop generally contain very low values of loudness, high values of instrumentalness (low vocals), low values of valence (discrete_valence_Negative), low danceability and low energy.

In other parts of the work we see instrumentalness as one of the most important features for the classifiers ans here we can see that it is important for rules implying both hit and flop.

Also, in the other classification analysis we find that groups associated to flop seem more defined that rules associated to hits. In our rule selection we have seen that we obtained rules with the same support but with both higher confidence and lift when looking for a flop compared to when looking for a hit. This seems coherent and may imply that flop is is certain cases more clearly defined in terms of it's features than a hit.

Considering the method of Association Rule Mining in itself, this is a powerful method and general concept that enables us to extract interesting relations in our data, be it shopping cart data, stock analysis or musical track information.

The downside, like mentioned in the literature, is that it is difficult to arrive at a single best conclusion for the data, it is also difficult to analyse because one measure can be good in one dataset but bad in another. Also, because of all this, a good amount of knowledge domain in necessary to help decide what could really be interesting in a real wold application of the extracted value.

2.5. Introduction to Clustering

2.5.1 KMeans

K-means Clustering is a simplest unsupervised machine learning algorithm. The objective of this algorithm is to group similar data points together and discover underlying patterns, looking for a fixed number (k - number of centroids) of clusters in a dataset. The K-means algorithm will performs iteratively (nondeterministic, meaning that cluster assignments will vary if we run the same algorithm twice on the same dataset) with a first group of randomly selected centroids in order to allocate every data point to the nearest cluster, while keeping the centroids as small as possible.

2.5.2 Hierarchical

How we calcule distances?

There are several ways to calculate a distance between two points or clusters. The most common is the Euclidean, which calculates the mathematical substraction between point A and B. Then we square and root in order to obtain the absolute distance.

Also exists Mahalanobis distance, which takes account of the correlation thus, is scale-invariant.

In order to simplify the interpretation we have chose Euclidean.

Which types of linkage have we got?

Once we have decided the optimal distance method to apply, we have to determine which linkage is more appropiate to our data and purpose.

The most important linkage methods are: Single, Complete, Centroid, Average and Ward.

Single linkage calculates the distance between the closest points between Cluster1 and Cluster2. These distances are calculated for all the cluster each other. Finally, the two clusters which distances are the smallest are joint. Complete linkage is similar than single, except it calculates the distance between the most remote points between clusters. Same thing with Centroid, which differenciates the middle point of each cluster.

Average method computes the distance between all the points between two clusters and calculates the average. After being estimated these distances for each pair of clusters we join the two with lowest results.

Ward method is based on the mean distance within the clusters. We suposse the join of all the cluster between them in pairs. The linkage of two clusters which implicates a lower increase on this centroid distance is carried out.

The team have decided to use the Ward linkage method, because of its robustness.

Hierarchical Cluster analysis

Next, the team has developed a cluster analysis following the agglomerative method. The way we build hierarchical cluster is quite different than K-Means. Here, in the beginning we consider all the observations as an individual cluster, then in each iteration we join the two closest. We repeat the process untill we reach the number of clusters we want, or all the instances are in the same group.

First of all, the distances matrix is computed with the intention of find the closest clusters. Then the clusters A and B are joint in a unique cluster C. Next, the distances matrix is recalculated as many times as we reach the optimal number of clusters.

2.5. Preprocessing Data for Clustering

2.5.1 - Pre-processing for K-means Clustering

We scaled our numerical features dataframe with a StandardScaler(). It is an important step for most distance-based machine learning algorithms because it can have a significant impact on the performance of our algorithm.

In a first approach, we will proceed to find the appropriate number of K (centroids) to be used in our classification. In the next section, we performed the elbow method and in addition:

It is important to note that they were only used numerical features:

2.6. Finding Groups

2.6.1 KMeans

We performed the elbow method, which run several K-means in a range of 20, increment K with each iteration, and record the SSE. When we plot SSE as a function of the number of cluster, it decreased as we increased K. As more centroids are added, the distance from each point to its closest centroid decreases.

There is a sweet spot where the SSE curve starts to bend known as the elbow point. The x-value of this point is thought to be a reasonable trade-off between error and number of clusters.

After performing the elbow method and the KneeLocator, both methods recommended using 8 clusters. Since the result of the SSE curve is not clear, in addition, we performed the silhouette coefficient (with values of K in a range of 20) which is a measure of cluster cohesion and separation.

The silhouette coefficient quantifies how well a data point fits into its assigned cluster based on two factors: how close the data point is to other points in the cluster and how far away the data point is from points in other clusters.

Inferring the result of the silhouette coefficient plot, in a first approach we have tested the model using 4 clusters. Although the results were not satisfactory so we proceeded to use 8 clusters in the K-means model.

After fiting and predicting the K-means model with 8 cluster, we check that the lowest SSE value was 273010.50 in 25 iterations. Higher SSE values indicate that samples are closer to their clusters than they are to other clusters.

Finding Similar Groups of Tracks

To try to characterize and find similar groups of tracks, we performed a Principal Component Analysis.

PCA

In order to know how many principal components we will reduce our data into, we performed a cumulative explained variance plot.

The slope is steeper until the second component, so we choose two principal components.

2.6.2 Hierarchical

How many cluster do we need? Elbow Method

In oposotion to KMeans method, we don't know the number of clusters that optimize the conglomeration. So, we need to compute some indicator to undestand how many clusters do we need.

One of the most common index is the Elbow method. It calculates the SSE (Squared Standard Error) for differents models with different number of clusters. It is desiderable to minimize the SSE, so higher the number of clusters lower the SSE, but if we have too many clusters the interpretation is hard. In conclusion we stop adding new clusters when SSE reduction is not longer significant.

The function elbow_clus calculates the SSE for models from 1 to 7 clusters. When we find the elbow we fix the optimal number of clusters.

image.png

Silhouette Coefficient

Silhouette coeficient is useful to measure how the clusters are separetad between them and concetrated within them. It varies from -1 to 1. Higher the coefficient better the clusterization.

Next, the formula, when:

a: mean on the intra cluster distances

b: mean of the inter cluster distances

It could be also used to determine the optimal number of cluster, since the user builds several cluster with different number of cluster and then it compares the coeficients.

image.png

2.7. Clustering - Results and Discussion

2.7.1 KMeans

Using the merged dataframe between the principal components and the scaled numerical features and targets (hit and discrete_valence, we performed PCA plots in order to characterize the clusters. We start by plotting an overall PCA chart, where the clusters are colored, in order to be able to distinguish easily.

Then we plot the targets vs the clusters: Hit Vs Clusters and Discrete Valente Vs Clusters

Finally for the numerical features we plot an independent PCA for each one, versus the clusters - colored by the feature. Altough, there are only represent the PCA plot for the features with relevant results.

A) Overall Clusters

B) Hit Vs Clusters

C) Discrete Valence Vs Clusters

D) Danceability Vs Clusters

E) Energy Vs Clusters

F) Acousticness Vs Clusters

G) Instrumentalness Vs Clusters

2.7.2 Hierarchical

From which features we define the groups?

The set of variables we dispose is too big for consider all of them to group the tracks. So, we have found other ways to select the most relevant features.

The team has done several tests with different combinations of variables and number of clusters finally we have chosen to build a model with energy and danceability. The relation between these two variables is very interessting because danceability increases while energy does untill it reach some point in which danceability drops again.

image.png

image-3.png

The graph above shows the distribution of the clusters. The size of the three cathegories is quite balanced. The blue group has the lowest values in energy and danceability. The green has medium and high levels of energy but not of danceability, which is never big. Finally, the red group has the highest values of danceability.

In the table we can also analyze the ratio of Hits. The blue has low levels, so the most part of these tracks are flops. Red and green groups are more balanced. However, this distribution will be discuss below.

Silhouette coeficient is not very high, since it is optimal from 0.5, but watching the size and concentration of our data is it too dificult to have very separated clusters, so reach the highest values of the coefficient is not possible

image.png

Differences between groups

In order to be sure that our cluster model discriminates well the tracks we have test if the distribution of the Hit is different between the three new groups.

The test we carry out is Independant Chi-squared.

The tests compares the observed values against the expected values. Next, the observed and expected cross tables

image-2.png

image.png

image.png

image.png

2.8 Results and discussion for Clustering

2.8.1 KMeans

Facing the overall cluster PCA plot (2.7.1-A) with the Hit Vs Clusters (2.7.1-B) and Discrete Valence (2.7.1-C) PCA plots, we can infer that the cluster 1 can represented the hit tracks with positive discrete valence, and in other hand the cluster 3 is characterized by the aggregation of the flop tracks with negative discrete valence. Therefore, an hit track is directly associated with a positive discrete valence and a flop track is associated with a negative discrete valence.

Inferring the features, by analysing the respective PCA plots, some feature values can predict if a track can be a hit, and at the same time predict a positive discrete valence, which are:

In constraste, several feature values can predict if a track can be a flop, and at the same time predict a negative discrete valence, which are:

3. Final Comments and Conclusions

In conclusion, the data science radio team did a tour for several data analysis and machine learning techniques. When it took the data the first thing to do was to understand what it contains, which kind of information it has. For that, the team has drawn a report in which it could make the explanatory and descriptive analysis. The Pearson's correlation coefficient was also deployed. Next, the team starts with the prediction of the variable Hit. Remember that were built four models: Logistic Regression, Naive Bayes, Decision Tree and KNN. Results between the four models have differences and similitudes. The 4 models share 3 variables: instrumentalness, danceability and accousticness. Also, energy is included in Logistic Regression, Naive Bayes and KNN. Naive Bayes is the only model with liveness. Finally, time_signature and key are only significant in Logistic Regression. These differences are not necessarily good or bad but induce strong and weaknesses among the different functions of the models. One good thing they all have in common is that all of them are stronger in recall (The reason for being good is discussed below). The performances of every model are good, accuracy is greater than 73% in all of them. Precision, recall and f1 score also shows power in the prediction model. However, the better classifier we have built is KNN. Its recall is up to 90%, also has the highest precision (77%) and accuracy (82%). Logistic Regression was also validated through AUC (Area Under the Curve). From ROC and Recall-Precision Curves. ROC Curve shows more than 80% of the quality of classification. Besides, Recall-Precision AUC was weaker, but still acceptable (62%). Finally, was designed a new cut-off to determine from which probability of occurrence the team will consider what a Hit is. For that it made a simulation with three scenarios on the Logistic Regression model. Differences between them were shown through the same indicators as above. This analysis was focused on the minimization of the false positive, to reduce the number of Flops. The indicator which better optimize this requirement is recall. So, by increasing the threshold we difficult the possibility of being Hit, thus we eliminate a lot of Flops predicted as Hits. In consequence, we are considering Hits as Flops, so we lose precision. The final decision was to increase the threshold up to 0.7. Reaching 78% of recall. Following, Discrete valence prediction has several differences with Hit prediction. For instance, the dependent variable has three levels instead of two. Which difficulties both the construction and the validation. The same four models were built. In general, the four models we built have not that prediction power such as in Hits. The accuracies obtained are between 60% and 65%. Especially the neutral value did not reach 50% of precision in Logistic Regression. On the other hand, were developed three unsupervised models. One Rule Mining and two Cluster Analysis, KMeans and Hierarchical. The way they work are different and the vision and utilities we have done to them are also diverse. Rule Mining use indicators such as support and lift to associate observations, which are based on the repetition of items in the same tuple. Later, the main differences between both clusters are the way they are built. Hierarchical starts from zero, every instance is considered a cluster, then they are joint by proximity. Besides, Kmeans starts with N centroids in which the closest points are joint. KMeans was complemented with a PCA analysis, in order to visualize in a single graph all the information of the variables combined with the clusterization. Several differences were found thanks to this analysis. Such as the negative influence of instrumentalness on Hit. Agglomerative clustering has defined 3 groups of tracks in fucntion of energy and danceability. To use only two variables make easy to explain to an artist what a Hit is. Also, because both concepts are very clear. Besides, clustering just say us how the clusters are, but nothing about the Hits. The analysis continues. Hypothesis tests were designed to prove the differences found in the hierarchical clustering in terms of Hit. The goal was to know if having high (or low) values of energy and danceability intercepts in the Hit distribution. The test applied was a Chi-Squared for independence. The differences between the observed and expected values were huge, so the team can reject the null hypothesis of independence in distributions. So yes, higher the danceability and energy (until a limit) higher the probability of Hit. The main benefit of the entire project is to make understand the artist how they have to make their music if they want to be successful. Instrumentalness and accousticness are rather related to Flops than Hits. Neither speechiness is favorable to Hits. On the other hand, the rest of variables are more related to Hits. Some conclusions would go further if some others informations were included, such as genre, country of origin or number of band members. However, to work on this project has helped us understand: the utilization and combination of several machine learning techniques and some characteristics of the music industry, such us the measures they use to understand the music. Are they the music notes of the XXI century?